Applied Mathematics for Class 11th & 12th (Concepts and Questions) | ||
---|---|---|
11th | Concepts | Questions |
12th | Concepts | Questions |
Content On This Page | ||
---|---|---|
Data Interpretation | Measure of Dispersion | Skewness and Kurtosis |
Percentile Rank and Quartile Rank | Correlation |
Chapter 10 Descriptive Statistics (Concepts)
Welcome to this essential chapter on Descriptive Statistics, a fundamental branch of statistics concerned with the crucial initial steps of data analysis: summarizing, organizing, and describing the main features of a dataset. In the realm of Applied Mathematics, raw data is often abundant but initially chaotic; descriptive statistics provides the indispensable tools to transform this raw information into understandable summaries and meaningful insights, laying the groundwork for further analysis, interpretation, and informed decision-making. This chapter will equip you with the necessary techniques to effectively explore datasets, identify patterns, and communicate key characteristics through tables, graphs, and numerical measures.
Our journey begins with understanding different types of data (such as quantitative vs. qualitative, discrete vs. continuous) and exploring effective methods for organizing and presenting it. Raw data is often unwieldy, so we focus on constructing frequency distribution tables, both for ungrouped and grouped data. This involves defining appropriate class intervals, determining class marks (midpoints), and calculating frequencies and cumulative frequencies. Building upon tabular presentation, we explore powerful graphical representations that provide visual insights into data patterns. These include familiar tools like bar graphs, but also more sophisticated methods like histograms (which accurately depict frequency distributions for continuous data, potentially including techniques for handling unequal class widths), frequency polygons, and ogives (cumulative frequency curves – both 'less than' and 'more than' types). Emphasis is placed not just on constructing these graphs, but critically, on their correct interpretation.
While presentation gives structure, numerical summaries provide concise descriptions of data characteristics. We delve deeply into Measures of Central Tendency, which aim to locate the 'center' or typical value of a dataset. The three primary measures covered are:
- The Mean ($\bar{x}$): The arithmetic average, calculated using the Direct Method, or more efficiently for complex grouped data using the Assumed Mean Method or the Step-Deviation Method.
- The Median ($M$): The middle value when data is ordered (robust to outliers). For grouped data, it's calculated using the formula $M = l + [\frac{(\frac{N}{2})-cf}{f}]\times h$, where $l, N, cf, f, h$ represent the lower limit, total frequency, cumulative frequency of the preceding class, frequency of the median class, and class width, respectively.
- The Mode ($M_0$): The most frequently occurring value. For grouped data, the modal class is identified, and the mode is estimated using the formula $M_0 = l + [\frac{f_1-f_0}{2f_1-f_0-f_2}]\times h$, where $f_1$ is the frequency of the modal class and $f_0, f_2$ are frequencies of the preceding and succeeding classes.
Central tendency alone provides an incomplete picture. We must also understand the data's spread or variability using Measures of Dispersion. These quantify how scattered the data points are around the central value. We explore:
- The Range: Simplest measure (Maximum - Minimum), but highly sensitive to extreme values.
- Quartile Deviation: Based on the first ($Q_1$) and third ($Q_3$) quartiles, using the Interquartile Range (IQR = $Q_3 - Q_1$). It's less affected by outliers.
- Mean Deviation: The average of the absolute deviations from either the mean or the median ($\frac{1}{N}\sum\limits_{i=1}^{k} f_i |x_i - \bar{x}|$ or $\frac{1}{N}\sum\limits_{i=1}^{k} f_i |x_i - M|$ for grouped data).
- Variance ($\sigma^2$) and Standard Deviation ($\sigma$): The most important measures, based on the squared deviations from the mean. The standard deviation ($\sigma = \sqrt{\text{Variance}}$) represents the typical deviation. For grouped data, shortcut formulas like the step-deviation method $\sigma^2 = h^2 [\frac{\sum\limits_{i=1}^{k} f_i u_i^2}{N} - (\frac{\sum\limits_{i=1}^{k} f_i u_i}{N})^2]$ (where $N=\sum\limits_{i=1}^{k} f_i$) are often used for computational efficiency.
Finally, to compare the variability of datasets that might have different means or units, we introduce the concept of relative dispersion using the Coefficient of Variation (CV). Calculated as $CV = \frac{\sigma}{|\bar{x}|} \times 100$, it expresses the standard deviation as a percentage of the mean, allowing for meaningful comparisons of consistency or stability between different datasets. This chapter provides a comprehensive toolkit for the essential first steps in any data analysis endeavor: exploration, summary, and description.
Data Interpretation
Descriptive Statistics is a fundamental area of statistics that focuses on methods for collecting, organizing, summarizing, and presenting data in a way that its key features can be easily understood. It provides simple summaries about the sample and the observations that have been made. These summaries can be numerical (like mean, median, variance) or graphical (like histograms, bar charts).
Data Interpretation is a crucial component of descriptive statistics. It is the process of reviewing, analysing, and assigning meaning to collected data to arrive at relevant conclusions and understand the implications of the findings. Raw data, in its original form, is usually a mass of numbers or information that is difficult to grasp. Interpretation transforms this raw data into meaningful insights, making it useful for decision-making, identifying trends, or understanding phenomena.
Importance of Data Interpretation
Effective data interpretation is essential in virtually every field – from business and economics to science and social studies. Its importance stems from its ability to:
Aid Decision Making: By summarizing and clarifying data, interpretation provides a solid basis for making informed decisions, predicting outcomes, and planning strategies.
Reveal Trends and Patterns: Interpretation helps in uncovering underlying patterns, trends, and relationships within the data that might not be apparent otherwise, such as seasonal sales patterns or changes in consumer behaviour.
Facilitate Communication: Well-interpreted data, presented visually or numerically, makes complex information accessible and understandable to a wider audience, enabling clear communication of findings.
Measure Performance: It allows for the evaluation of performance against goals, benchmarks, or historical data, which is critical for assessment and improvement in areas like business, education, or public health.
Build Foundation for Further Analysis: Descriptive statistics and interpretation are often the first step before applying more advanced statistical techniques (inferential statistics) to test hypotheses or make broader generalizations about a population.
Types of Data
Data can be classified in various ways. A common classification relevant to descriptive statistics distinguishes between quantitative and qualitative data:
1. Quantitative Data:
This type of data consists of numerical values that represent quantities. Quantitative data can be measured or counted.
Discrete Data: Data that can take only a finite number of values or a countably infinite number of values. These are typically obtained by counting.
Examples: The number of students in a classroom ($20, 25, 30$), the number of cars passing a point on a road ($0, 1, 2, \dots$), the number of defective items in a batch. These values are usually integers and cannot take values in between.
Continuous Data: Data that can take any value within a given range. These are typically obtained by measurement.
Examples: Height of a person ($165.5$ cm, $170.25$ cm), weight of an object ($50.3$ kg, $50.345$ kg), temperature ($30^\circ\text{C}$, $30.5^\circ\text{C}$), time taken to complete a task. These values can theoretically take any value within an interval.
2. Qualitative Data (Categorical Data):
This type of data represents categories, attributes, or non-numerical properties. While categories might sometimes be coded numerically, the numbers themselves do not have mathematical meaning in terms of magnitude or order (unless it's ordinal data).
Nominal Data: Data that represent categories with no intrinsic order or ranking.
Examples: Gender (Male, Female, Other), Blood Group (A, B, AB, O), Colour of eyes (Blue, Brown, Green), Marital Status (Single, Married, Divorced, Widowed).
Ordinal Data: Data that represent categories with a natural order or ranking, but the differences between categories are not necessarily uniform or quantifiable.
Examples: Education Level (High School, Graduate, Postgraduate), Rating Scales (Poor, Fair, Good, Excellent), Socioeconomic Status (Low, Medium, High), Competition Ranks (1st, 2nd, 3rd).
Understanding the type of data is crucial as it determines the appropriate methods for organization, presentation, and statistical analysis.
Methods of Data Representation
Raw data needs to be organized and presented effectively to facilitate interpretation. The primary methods include tabular and graphical presentations.
1. Tabular Presentation (Frequency Distribution):
Organizing data into tables is one of the most basic and effective ways to summarize information. A Frequency Distribution Table lists the unique values or categories of a variable and their corresponding frequencies (the number of times each value/category appears in the dataset).
For large datasets, especially continuous data, data is grouped into Class Intervals, and the table shows the frequency of observations falling within each interval.
Components of a Frequency Distribution Table (for Grouped Data):
Class Intervals: These are the ranges into which the data is grouped (e.g., $0-10$, $10-20$, $20-30$). They should ideally be mutually exclusive (an observation falls into only one interval) and exhaustive (all observations fall into some interval). Intervals can be inclusive (e.g., $0-9, 10-19$) or exclusive (e.g., $0-10, 10-20$, where 10 belongs to the second interval).
Tally Marks: Used as an intermediate step in manual frequency counting. A stroke is made for each observation falling into a class, and every fifth stroke crosses the preceding four ($\bcancel{||||}$).
Frequency (f): The number of observations that fall within a specific class interval. The sum of all frequencies equals the total number of observations ($\sum f = N$ or $\sum f_i = N$).
Class Mark (Midpoint, $x_i$): The representative value for a class interval. It is calculated as the average of the lower and upper class limits (or boundaries) of the interval.
$\text{Class Mark} = \frac{\text{Lower Class Limit} + \text{Upper Class Limit}}{2}$
Class Size (Width, h or c): The difference between the upper and lower class boundaries of a class interval. For exclusive classes ($0-10, 10-20$), the width is simply the difference between the upper and lower limits ($10-0=10$). For inclusive classes ($0-9, 10-19$), the class boundaries are often adjusted by $0.5$ to make them continuous ($0-9$ becomes $-0.5$ to $9.5$, $10-19$ becomes $9.5$ to $19.5$). The width is then $9.5 - (-0.5) = 10$ or $19.5 - 9.5 = 10$. Or, more simply, the difference between consecutive lower limits (or upper limits) of exclusive or inclusive classes ($10-0=10$, $10-0=10$).
Relative Frequency: The proportion of observations falling into a particular class interval. It is calculated as the frequency of the class divided by the total number of observations ($f_i / N$). The sum of relative frequencies is always 1.
Cumulative Frequency (cf): The running total of frequencies. The cumulative frequency for a class is the sum of the frequencies of that class and all preceding classes. It indicates the number of observations less than or equal to the upper boundary of the class interval.
Example 1. Construct an ungrouped frequency distribution table for the following marks obtained by 20 students in a test:
5 | 10 | 15 | 10 | 5 | 15 | 20 | 10 | 15 | 5 |
20 | 10 | 15 | 5 | 10 | 20 | 15 | 10 | 5 | 15 |
Answer:
The given data represents the raw marks of 20 students. To create an ungrouped frequency distribution, we list each distinct mark obtained and count how many times it appears in the dataset.
The distinct marks are 5, 10, 15, and 20.
Let's tally the occurrences of each mark:
- Mark 5: Occurs in the dataset 5 times. Tally: $\bcancel{||||}$
- Mark 10: Occurs in the dataset 6 times. Tally: $\bcancel{||||}\,|$
- Mark 15: Occurs in the dataset 6 times. Tally: $\bcancel{||||}\,|$
- Mark 20: Occurs in the dataset 3 times. Tally: $|||$
The sum of frequencies is $5 + 6 + 6 + 3 = 20$, which matches the total number of students.
The frequency distribution table is as follows:
Marks | Tally Marks | Frequency (f) |
---|---|---|
5 | $\bcancel{||||}$ | 5 |
10 | $\bcancel{||||}\,|$ | 6 |
15 | $\bcancel{||||}\,|$ | 6 |
20 | $|||$ | 3 |
Total | 20 |
This table clearly shows the distribution of marks, indicating, for example, that the marks 10 and 15 were the most frequent scores among the students.
2. Graphical Presentation:
Graphs and charts provide a visual summary of the data, making it easier to identify patterns, shapes, and trends. Common graphical representations include:
Bar Graph: Used for categorical or discrete data. It consists of bars of equal width, with the height of each bar proportional to the frequency or value of the category/variable. Bars are typically separated by gaps.
Histogram: Used for continuous data grouped into class intervals. It is a set of adjacent rectangles (bars) where the base of each rectangle represents a class interval and the area is proportional to the frequency of that class. For equal class widths, the height of the bar is proportional to the frequency.
Frequency Polygon: Can be drawn by joining the midpoints of the tops of the rectangles in a histogram with line segments. It can also be drawn directly from a frequency distribution table by plotting class marks against frequencies and joining the points.
Ogive (Cumulative Frequency Curve): A graph that represents the cumulative frequencies. It is plotted by taking class boundaries on the x-axis and cumulative frequencies on the y-axis. Ogives are useful for finding quartiles, percentiles, etc.
Pie Chart: A circular chart divided into sectors, where each sector represents a proportion of the whole. The area of each sector is proportional to the frequency or relative frequency of the category it represents. Used primarily for categorical data.
Line Graph: Used to show trends over time or across ordered categories. Data points are plotted and connected by lines.
Scatter Plot: Used to display the relationship between two quantitative variables. Each point on the graph represents a pair of values for the two variables.
Interpretation of Presented Data
The final step in the descriptive process is interpreting the organised and presented data. This involves examining the summaries (tables, graphs, or numerical measures which we will discuss later) to understand the characteristics of the dataset. Key aspects of interpretation include:
Central Tendency: Identifying the typical or central value around which the data tends to cluster (e.g., what is the average score? What is the most common score?). This will be explored using measures like Mean, Median, and Mode.
Dispersion (Variability): Understanding how spread out the data is (e.g., are the scores tightly clustered or widely scattered?). This is assessed using measures like Range, Variance, and Standard Deviation.
Shape of Distribution: Observing the overall form of the distribution, including whether it is symmetrical or skewed (leaning to one side) and how peaked or flat it is. This is described using measures of Skewness and Kurtosis.
Outliers: Identifying any extreme values that lie far away from the rest of the data points.
Relationships: If multiple variables are involved, exploring potential relationships or associations between them (e.g., does height tend to increase with age?). This might involve looking at scatter plots or calculating correlation coefficients.
By combining these observations, one can provide a comprehensive description and interpretation of the dataset, drawing meaningful conclusions relevant to the context from which the data was collected. The subsequent sections will detail the methods for calculating the quantitative measures mentioned above.
Measure of Dispersion
While measures of central tendency like the mean, median, and mode provide a single value that represents the center or typical value of a dataset, they do not tell us anything about how the data is spread out or varies around that center. Two datasets can have the same mean or median but can be vastly different in terms of their variability.
For example, consider the marks of two students, A and B, in five tests out of 100:
Student A: 50, 50, 50, 50, 50 (Mean = 50)
Student B: 0, 25, 50, 75, 100 (Mean = 50)
Both students have the same average mark, but Student A's marks are consistently 50 (no variability), while Student B's marks are widely spread out. Measures of central tendency alone cannot capture this difference.
Measures of Dispersion (also known as measures of variability or spread) quantify the extent to which data points in a set differ from the average or from each other. They provide information about the homogeneity or heterogeneity of the data. A small measure of dispersion indicates that data points are clustered closely around the center, while a large measure of dispersion indicates that data points are spread out over a wider range.
Types of Measures of Dispersion
Measures of dispersion can be classified into two main types:
Absolute Measures of Dispersion: These measures are expressed in the same units as the original data. They indicate the amount of variation in the data.
Examples: Range, Quartile Deviation, Mean Deviation, Standard Deviation.
Relative Measures of Dispersion: These measures are ratios or percentages and are unitless. They are used to compare the variability of two or more datasets, especially when the datasets have different units or different means.
Examples: Coefficient of Range, Coefficient of Quartile Deviation, Coefficient of Mean Deviation, Coefficient of Variation.
Let's discuss the common absolute and relative measures.
Common Measures of Dispersion
1. Range:
The Range is the simplest measure of dispersion to calculate. It is defined as the difference between the maximum (largest) value and the minimum (smallest) value in a dataset.
$\text{Range} = \text{Maximum Value} - \text{Minimum Value}$
... (i)
Advantages:
- Easy to understand and calculate.
Disadvantages:
- It is highly sensitive to extreme values or outliers, as it only considers the two most extreme observations.
- It does not provide any information about the distribution of values between the minimum and maximum.
- It is not suitable for open-ended class intervals in grouped data.
Coefficient of Range (Relative Measure):
$\text{Coefficient of Range} = \frac{\text{Maximum Value} - \text{Minimum Value}}{\text{Maximum Value} + \text{Minimum Value}}$
This is a unitless measure used for comparison.
2. Quartile Deviation (Semi-Interquartile Range):
The Quartile Deviation (QD) is a measure of dispersion based on the first and third quartiles. Quartiles divide an ordered dataset into four equal parts.
First Quartile ($Q_1$): The value below which $25\%$ of the data lies (25th percentile).
Second Quartile ($Q_2$): The value below which $50\%$ of the data lies (50th percentile or Median).
Third Quartile ($Q_3$): The value below which $75\%$ of the data lies (75th percentile).
The difference between the third and first quartile ($Q_3 - Q_1$) is called the Interquartile Range (IQR). The Quartile Deviation is half of the Interquartile Range.
$\text{QD} = \frac{Q_3 - Q_1}{2}$
... (ii)
QD measures the average distance of the first and third quartiles from the median. It essentially gives the spread of the middle $50\%$ of the data.
How to Find Quartiles:
For Ungrouped Data: First, arrange the data in ascending order.
$Q_1$ is the value at the $\left(\frac{n+1}{4}\right)^{\text{th}}$ position.
$Q_2$ (Median) is the value at the $\left(\frac{n+1}{2}\right)^{\text{th}}$ position.
$Q_3$ is the value at the $\left(\frac{3(n+1)}{4}\right)^{\text{th}}$ position.
If the position is not an integer, interpolation is used (e.g., the 2.5th position is the average of the 2nd and 3rd values).
For Grouped Data: First, construct the cumulative frequency table. The quartile ($Q_k$) falls in the class interval where the cumulative frequency is just greater than $\frac{kN}{4}$ (for $k=1, 2, 3$). The formula for $Q_k$ is:
$Q_k = L + \frac{\frac{kN}{4} - \text{cf}_{k-1}}{f_k} \times h$
Where: $L$ is the lower boundary of the quartile class, $N$ is the total frequency, $\text{cf}_{k-1}$ is the cumulative frequency of the class preceding the quartile class, $f_k$ is the frequency of the quartile class, and $h$ is the class size of the quartile class.
Advantages:
- It is not affected by extreme values, as it only uses the data from the central $50\%$.
- Can be calculated for distributions with open-ended class intervals.
Disadvantages:
- It does not use all the data points, ignoring the lowest $25\%$ and highest $25\%$.
- It is not suitable for algebraic treatment.
Coefficient of Quartile Deviation (Relative Measure):
$\text{Coefficient of QD} = \frac{Q_3 - Q_1}{Q_3 + Q_1}$
This is a unitless measure.
3. Mean Deviation:
The Mean Deviation (MD) is the average of the absolute deviations of the data points from a measure of central tendency (usually the mean or median). Absolute values are used because the sum of deviations from the mean is always zero.
Mean Deviation from the Mean ($\overline{x}$):
For Ungrouped Data:
$\text{MD} (\overline{x}) = \frac{\sum_{i=1}^{n} |x_i - \overline{x}|}{n}$
... (iii)
Where $x_i$ are the data points, $\overline{x}$ is the mean, and $n$ is the number of data points.
For Grouped Data:
$\text{MD} (\overline{x}) = \frac{\sum_{i=1}^{k} f_i |m_i - \overline{x}|}{N}$
... (iv)
Where $m_i$ are the class marks, $f_i$ are the frequencies, $\overline{x}$ is the mean, $k$ is the number of classes, and $N = \sum f_i$ is the total frequency.
Mean Deviation from the Median (M):
For Ungrouped Data:
$\text{MD} (M) = \frac{\sum_{i=1}^{n} |x_i - M|}{n}$
For Grouped Data:
$\text{MD} (M) = \frac{\sum_{i=1}^{k} f_i |m_i - M|}{N}$
(Where $M$ is the median, calculated using appropriate formulas for ungrouped/grouped data). Mean deviation is minimal when calculated from the median.
Advantages:
- Uses all the data points.
- Relatively easy to understand.
Disadvantages:
- The use of absolute values makes it less suitable for advanced mathematical and statistical operations (unlike variance and standard deviation).
- It is affected by extreme values (though less so than the Range).
Coefficient of Mean Deviation (Relative Measure):
$\text{Coefficient of MD} = \frac{\text{Mean Deviation}}{\text{Mean or Median}}$
This is calculated based on whether the mean deviation is computed from the mean or the median.
4. Variance and Standard Deviation:
These are the most important and widely used measures of dispersion. They are based on the squared deviations of data points from the mean, which removes the issue of the sum of deviations being zero and is mathematically more convenient than absolute values.
The Variance is defined as the average of the squared deviations of the data points from their mean.
The Standard Deviation (SD) is the positive square root of the variance. It is particularly useful because it is expressed in the same units as the original data, making it easier to interpret than the variance.
For Ungrouped Data:
Let $x_1, x_2, \dots, x_N$ be the $N$ observations of a population with population mean $\mu$.
$\text{Population Variance ($\sigma^2$)} = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
... (v)
$\text{Population Standard Deviation ($\sigma$)} = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}$
... (vi)
Let $x_1, x_2, \dots, x_n$ be the $n$ observations of a sample with sample mean $\overline{x}$.
$\text{Sample Variance ($s^2$)} = \frac{\sum_{i=1}^{n} (x_i - \overline{x})^2}{n-1}$
... (vii)
$\text{Sample Standard Deviation ($s$)} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \overline{x})^2}{n-1}}$
... (viii)
Note on Sample Variance ($n-1$): We use $n-1$ in the denominator for sample variance and standard deviation (known as Bessel's correction) to get an unbiased estimate of the population variance. If we used $n$, the sample variance would tend to underestimate the true population variance.
Alternative Formula for Variance (Ungrouped Data):
The calculation of variance can sometimes be simplified using an alternative formula, which avoids calculating deviations from the mean individually.
$\sum_{i=1}^{n} (x_i - \overline{x})^2 = \sum (x_i^2 - 2x_i \overline{x} + \overline{x}^2)$
$ = \sum x_i^2 - \sum (2x_i \overline{x}) + \sum \overline{x}^2$
$ = \sum x_i^2 - 2\overline{x} \sum x_i + n\overline{x}^2$
(Since $\overline{x}$ is a constant for the summation)
We know that $\overline{x} = \frac{\sum x_i}{n}$, so $\sum x_i = n\overline{x}$.
$\sum (x_i - \overline{x})^2 = \sum x_i^2 - 2\overline{x}(n\overline{x}) + n\overline{x}^2$
$= \sum x_i^2 - 2n\overline{x}^2 + n\overline{x}^2$
$= \sum x_i^2 - n\overline{x}^2$
... (ix)
So, the sample variance can also be calculated as:
$\text{Sample Variance ($s^2$)} = \frac{\sum x_i^2 - n\overline{x}^2}{n-1}$
... (x)
Or, substituting $\overline{x} = \frac{\sum x_i}{n}$:
$\text{Sample Variance ($s^2$)} = \frac{\sum x_i^2 - n\left(\frac{\sum x_i}{n}\right)^2}{n-1} = \frac{\sum x_i^2 - n\frac{(\sum x_i)^2}{n^2}}{n-1}$
$= \frac{\sum x_i^2 - \frac{(\sum x_i)^2}{n}}{n-1} = \frac{\frac{n\sum x_i^2 - (\sum x_i)^2}{n}}{n-1}$
$\text{Sample Variance ($s^2$)} = \frac{n\sum x_i^2 - (\sum x_i)^2}{n(n-1)}$
... (xi)
Similarly, for the population variance:
$\text{Population Variance ($\sigma^2$)} = \frac{\sum x_i^2 - N\mu^2}{N}$
$= \frac{N\sum x_i^2 - (\sum x_i)^2}{N^2}$
For Grouped Data:
Let $m_i$ be the class marks, $f_i$ be the frequencies, $k$ be the number of classes, and $N = \sum f_i$ or $n=\sum f_i$ be the total frequency (for population and sample respectively).
$\text{Population Variance ($\sigma^2$)} = \frac{\sum_{i=1}^{k} f_i (m_i - \mu)^2}{N}$
... (xii)
$\text{Population Standard Deviation ($\sigma$)} = \sqrt{\frac{\sum_{i=1}^{k} f_i (m_i - \mu)^2}{N}}$
... (xiii)
$\text{Sample Variance ($s^2$)} = \frac{\sum_{i=1}^{k} f_i (m_i - \overline{x})^2}{n-1}$
... (xiv)
$\text{Sample Standard Deviation ($s$)} = \sqrt{\frac{\sum_{i=1}^{k} f_i (m_i - \overline{x})^2}{n-1}}$
... (xv)
Alternative formulas for grouped data variance:
$\sigma^2 = \frac{\sum f_i m_i^2 - N\mu^2}{N} = \frac{\sum f_i m_i^2}{N} - \mu^2$
$s^2 = \frac{\sum f_i m_i^2 - n\overline{x}^2}{n-1}$
Advantages:
- Uses all the data points.
- Mathematically tractable, suitable for further statistical analysis (e.g., in inferential statistics).
- Standard deviation is in the same units as the original data, making it easy to interpret.
Disadvantages:
- Affected by extreme values due to squaring the deviations.
- More complex to calculate manually compared to Range or QD.
5. Coefficient of Variation (CV):
The Coefficient of Variation (CV) is a relative measure of dispersion that expresses the standard deviation as a percentage of the mean. It is unitless and is particularly useful for comparing the variability of two or more datasets that have different units or significantly different means.
$\text{CV} = \frac{\text{Standard Deviation}}{\text{Mean}} \times 100\%$
... (xvi)
For sample data:
$\text{CV} = \frac{s}{\overline{x}} \times 100\%$
For population data:
$\text{CV} = \frac{\sigma}{\mu} \times 100\%$
A higher Coefficient of Variation indicates greater relative variability compared to the mean, while a lower CV indicates less relative variability (more consistency).
Example 1. Find the range, variance, and standard deviation for the following sample data:
2 | 4 | 4 | 5 | 6 | 8 |
Answer:
The given sample data points are $x_i$: 2, 4, 4, 5, 6, 8.
The number of data points in the sample is $n=6$.
1. Range:
Maximum value = 8
Minimum value = 2
$\text{Range} = \text{Maximum Value} - \text{Minimum Value} = 8 - 2 = 6$
The range of the data is 6.
2. Variance ($s^2$) and Standard Deviation ($s$):
Since this is sample data, we will use the formulas for sample variance and standard deviation (using $n-1$ in the denominator). First, we need to calculate the sample mean ($\overline{x}$).
$\overline{x} = \frac{\sum x_i}{n} = \frac{2 + 4 + 4 + 5 + 6 + 8}{6}$
$\overline{x} = \frac{29}{6}$
[Sample Mean]
Now, we calculate the squared deviations from the mean, $(x_i - \overline{x})^2$, for each data point. It's often easier to work with the exact fraction $\overline{x} = \frac{29}{6}$ or use the alternative formula to avoid rounding errors until the final step. Let's use the deviation method first.
$x_i - \overline{x}$:
- $2 - \frac{29}{6} = \frac{12-29}{6} = -\frac{17}{6}$
- $4 - \frac{29}{6} = \frac{24-29}{6} = -\frac{5}{6}$
- $4 - \frac{29}{6} = \frac{24-29}{6} = -\frac{5}{6}$
- $5 - \frac{29}{6} = \frac{30-29}{6} = \frac{1}{6}$
- $6 - \frac{29}{6} = \frac{36-29}{6} = \frac{7}{6}$
- $8 - \frac{29}{6} = \frac{48-29}{6} = \frac{19}{6}$
$(x_i - \overline{x})^2$:
- $(-\frac{17}{6})^2 = \frac{289}{36}$
- $(-\frac{5}{6})^2 = \frac{25}{36}$
- $(-\frac{5}{6})^2 = \frac{25}{36}$
- $(\frac{1}{6})^2 = \frac{1}{36}$
- $(\frac{7}{6})^2 = \frac{49}{36}$
- $(\frac{19}{6})^2 = \frac{361}{36}$
Sum of squared deviations: $\sum (x_i - \overline{x})^2 = \frac{289}{36} + \frac{25}{36} + \frac{25}{36} + \frac{1}{36} + \frac{49}{36} + \frac{361}{36}$
$\sum (x_i - \overline{x})^2 = \frac{289 + 25 + 25 + 1 + 49 + 361}{36} = \frac{750}{36}$
Now calculate the sample variance $s^2$ using Formula (vii):
$\text{s}^2 = \frac{\sum (x_i - \overline{x})^2}{n-1} = \frac{750/36}{6-1} = \frac{750/36}{5}$
$\text{s}^2 = \frac{750}{36 \times 5} = \frac{750}{180}$
$\text{s}^2 = \frac{\cancel{750}^{25}}{\cancel{180}_{6}} = \frac{25}{6}$
[Sample Variance]
The sample standard deviation $s$ is the square root of the variance (Formula (viii)):
$\text{s} = \sqrt{\frac{25}{6}} = \frac{\sqrt{25}}{\sqrt{6}} = \frac{5}{\sqrt{6}}$
[Sample Standard Deviation]
To get a numerical value, we can approximate $\sqrt{6} \approx 2.449$.
$\text{s} \approx \frac{5}{2.449} \approx 2.041$
Alternatively (using the shortcut formula):
Calculate $\sum x_i^2$:
$\sum x_i^2 = 2^2 + 4^2 + 4^2 + 5^2 + 6^2 + 8^2$
$= 4 + 16 + 16 + 25 + 36 + 64 = 161$
We already found $\sum x_i = 29$. $n=6$.
Using the alternative formula (xi) for sample variance:
$\text{s}^2 = \frac{n\sum x_i^2 - (\sum x_i)^2}{n(n-1)}$
$\text{s}^2 = \frac{6(161) - (29)^2}{6(6-1)}$
$\text{s}^2 = \frac{966 - 841}{6(5)} = \frac{125}{30}$
$\text{s}^2 = \frac{\cancel{125}^{25}}{\cancel{30}_{6}} = \frac{25}{6}$
[Sample Variance]
The variance is $\frac{25}{6}$. The standard deviation is $\text{s} = \sqrt{\frac{25}{6}} = \frac{5}{\sqrt{6}} \approx 2.041$.
Summary of Results:
- Range = 6
- Sample Variance ($s^2$) = $\frac{25}{6}$ or approximately 4.17
- Sample Standard Deviation ($s$) = $\frac{5}{\sqrt{6}}$ or approximately 2.04
Skewness and Kurtosis
In addition to describing the central tendency and dispersion of a dataset, it is also important to understand the shape of its distribution. Measures of central tendency (like mean, median, mode) tell us where the center lies, and measures of dispersion (like variance, standard deviation) tell us about the spread. However, these measures alone do not reveal whether the distribution is symmetric or asymmetric, or how peaked or flat it is. Skewness and Kurtosis are statistical measures that provide insight into these characteristics of a distribution's shape.
Skewness
Skewness is a measure of the asymmetry in a probability distribution. It indicates the degree to which the distribution is tilted or skewed towards one side relative to its mean. A symmetric distribution has a skewness of zero, while a skewed distribution will have either positive or negative skewness.
Understanding Skewness and its Effect on Measures of Central Tendency:
Symmetric Distribution: In a perfectly symmetric distribution, the data is distributed evenly around the center. The shape on the left side of the central point is a mirror image of the shape on the right side.
For a symmetric distribution, the mean, median, and mode are all equal:
Mean = Median = Mode
The skewness value for a symmetric distribution is 0. A classic example is the Normal Distribution (bell curve).
Positive Skewness (Right Skew): A distribution is said to be positively skewed if the tail on the right side of the distribution is longer or extends further than the tail on the left side. This indicates that there are some unusually large values (outliers) that pull the mean towards the right.
In a positively skewed distribution, the relationship between the measures of central tendency is typically:
Mean > Median > Mode
The peak of the distribution (mode) is to the left, followed by the median, and the mean is the largest because it is affected by the high values in the right tail.
Negative Skewness (Left Skew): A distribution is said to be negatively skewed if the tail on the left side of the distribution is longer or extends further than the tail on the right side. This indicates that there are some unusually small values that pull the mean towards the left.
In a negatively skewed distribution, the relationship between the measures of central tendency is typically:
Mean < Median < Mode
The peak of the distribution (mode) is to the right, followed by the median, and the mean is the smallest because it is affected by the low values in the left tail.
Visual Representation of Skewness:
Measures of Skewness:
Various coefficients have been developed to quantify the degree of skewness. These measures provide a numerical value indicating both the direction and magnitude of asymmetry.
1. Pearson's First Coefficient of Skewness (Based on Mode):
This coefficient is calculated using the difference between the mean and the mode, divided by the standard deviation.
$\text{S}_p1 = \frac{\text{Mean} - \text{Mode}}{\text{Standard Deviation}}$
... (iii)
Where:
- Mean ($\overline{x}$ or $\mu$) is the arithmetic average.
- Mode is the most frequent value.
- Standard Deviation ($s$ or $\sigma$) is the measure of spread.
Interpretation:
- If $\text{S}_p1 = 0$, the distribution is symmetric.
- If $\text{S}_p1 > 0$, the distribution is positively skewed.
- If $\text{S}_p1 < 0$, the distribution is negatively skewed.
This measure is useful when the mode is clearly defined and represents the peak of the distribution.
2. Pearson's Second Coefficient of Skewness (Based on Median):
When the mode is not well-defined or the distribution is multimodal, the relationship between the mean, median, and mode suggests another measure based on the median. For moderately skewed distributions, the mean, median, and mode follow the empirical relationship: Mean - Mode $\approx$ 3 (Mean - Median).
Using this relationship, Pearson's second coefficient of skewness is defined as:
$\text{S}_p2 = \frac{3 (\text{Mean} - \text{Median})}{\text{Standard Deviation}}$
... (iv)
Where:
- Mean ($\overline{x}$ or $\mu$) is the arithmetic average.
- Median is the middle value when data is ordered.
- Standard Deviation ($s$ or $\sigma$) is the measure of spread.
Interpretation is similar to Pearson's first coefficient:
- If $\text{S}_p2 = 0$, the distribution is symmetric.
- If $\text{S}_p2 > 0$, the distribution is positively skewed.
- If $\text{S}_p2 < 0$, the distribution is negatively skewed.
These coefficients are unitless measures, allowing for comparison of skewness across different datasets.
Kurtosis
Kurtosis is a measure that describes the shape of the tails and the peakedness of a distribution relative to a normal distribution. It indicates how much of the data is concentrated in the tails versus in the center and intermediate regions of the distribution.
Kurtosis is technically defined based on the fourth standardized moment of the distribution. The kurtosis of the standard normal distribution is 3. To make the normal distribution a benchmark with a kurtosis value of 0, Excess Kurtosis is often used, calculated as Kurtosis - 3.
For a sample, the sample kurtosis is calculated using the fourth sample moment about the mean. The formula is complex and typically covered in higher statistics courses, but the interpretation based on excess kurtosis is important for understanding the shape:
$\text{Excess Kurtosis} = \left( \frac{\frac{1}{n} \sum (x_i - \overline{x})^4}{s^4} \right) - 3$
... (v)
(Using sample standard deviation $s$ calculated with $n$ in the denominator for consistency with the moment definition, or adjusting if using $n-1$).
Types of Kurtosis (Based on Excess Kurtosis):
Mesokurtic: A distribution that has the same kurtosis as the normal distribution. Its peak and tails are neither too high/fat nor too low/thin compared to a normal curve.
Excess Kurtosis = 0. The normal distribution is the most well-known mesokurtic distribution.
Leptokurtic: A distribution that has a sharper peak and fatter (heavier) tails than a normal distribution. This implies that more of the data is concentrated around the mean and in the extreme tails, with less data in the "shoulders" (intermediate regions).
Excess Kurtosis > 0. Such distributions indicate a higher probability of extreme events or outliers compared to a normal distribution.
Platykurtic: A distribution that has a flatter peak and thinner (lighter) tails than a normal distribution. This implies that the data is spread out more evenly across the range, with fewer values concentrated near the mean or in the extreme tails.
Excess Kurtosis < 0. Such distributions indicate a lower probability of extreme events or outliers compared to a normal distribution.
Visual Representation of Kurtosis:
Interpretation of Kurtosis:
Kurtosis primarily informs us about the tails of the distribution. Higher kurtosis means fatter tails and thus a greater chance of observing extreme values (outliers). Lower kurtosis means thinner tails and a lower chance of extreme values. While it also affects the peak, the effect on tails is considered the more significant aspect.
Example 1. For a dataset, the Mean is 50, the Median is 48, and the Standard Deviation is 10. Calculate Pearson's second coefficient of skewness and interpret the result.
Answer:
Given:
- Mean ($\overline{x}$) = 50
- Median (M) = 48
- Standard Deviation ($s$) = 10
We need to calculate Pearson's second coefficient of skewness, $\text{S}_p2$. The formula (iv) is:
$\text{S}_p2 = \frac{3 (\text{Mean} - \text{Median})}{\text{Standard Deviation}}$
[Pearson's second coefficient]
Substitute the given values into the formula:
$\text{S}_p2 = \frac{3 (50 - 48)}{10}$
$\text{S}_p2 = \frac{3 (2)}{10}$
$\text{S}_p2 = \frac{6}{10}$
$\text{S}_p2 = 0.6$
[Coefficient of Skewness]
Interpretation:
The calculated value of Pearson's second coefficient of skewness is $0.6$.
Since the value ($0.6$) is positive and greater than 0, this indicates that the distribution of the dataset is positively skewed (right-skewed). This means the distribution has a longer tail on the right side, and the majority of the data values are concentrated towards the left side of the distribution. The Mean (50) is greater than the Median (48), which is consistent with positive skewness.
Percentile Rank and Quartile Rank
In statistics, it is often useful to understand the position of a particular value within a dataset relative to other values. Measures of position or location help us to determine where a specific observation stands in relation to the entire set of data, after the data has been ordered. Percentiles and Quartiles are important measures of position. They are values that divide a dataset into specific proportions.
Percentiles
Percentiles are values that divide a dataset that has been ordered from smallest to largest into 100 equal parts. There are 99 percentiles, denoted by $P_1, P_2, \dots, P_{99}$.
The $k$-th percentile ($P_k$) is a value in the dataset (or a value interpolated between two dataset values) such that approximately $k\%$ of the data values are less than or equal to $P_k$, and approximately $(100-k)\%$ of the data values are greater than or equal to $P_k$. For example, the 10th percentile ($P_{10}$) is the value below which approximately 10% of the data falls, and the 90th percentile ($P_{90}$) is the value below which approximately 90% of the data falls.
Percentiles are widely used, for instance, in reporting scores on standardized tests (e.g., a score in the 80th percentile means the student scored as well as or better than 80% of the test-takers) or in health statistics (e.g., growth charts for children).
Method for Finding the k-th Percentile ($P_k$) for Ungrouped Data:
Given a dataset of $n$ observations, follow these steps to find the $k$-th percentile ($P_k$):
Step 1: Arrange the data in ascending order (from smallest to largest). Let the ordered data points be $x_{(1)}, x_{(2)}, \dots, x_{(n)}$.
Step 2: Calculate the index or position of the $k$-th percentile using the formula:
$\text{Index } L = \frac{k}{100} \times n$
... (i)
Where $k$ is the desired percentile (e.g., 70 for the 70th percentile) and $n$ is the total number of data points.
Step 3: Determine the $k$-th percentile based on the calculated index $L$:
If $L$ is a whole number: The $k$-th percentile ($P_k$) is the average (mean) of the value at the $L$-th position and the value at the $(L+1)$-th position in the ordered dataset.
$\text{P}_k = \frac{x_{(L)} + x_{(L+1)}}{2}$
If $L$ is not a whole number: Round $L$ up to the nearest whole number. Let's call this rounded-up index $L'$. The $k$-th percentile ($P_k$) is the value at this $L'$-th position in the ordered dataset.
$\text{P}_k = x_{(L')}$ where $L' = \lceil L \rceil$
Percentile Rank
The Percentile Rank of a specific value $x$ in a dataset indicates the percentage of values in the dataset that are less than or equal to $x$. It tells us the standing of a particular score or value within the distribution.
To find the percentile rank of a value $x$ in a dataset of $n$ ordered values:
Step 1: Count the number of data points in the dataset that are less than or equal to $x$. Let this count be $C$.
Step 2: Calculate the percentile rank using the formula:
$\text{Percentile Rank of } x = \frac{\text{Number of values } \leq x}{\text{Total number of values}} \times 100$
... (ii)
Or, using the count $C$:
$\text{Percentile Rank of } x = \frac{C}{n} \times 100$
The result is expressed as a percentage.
Example 1. Consider the following test scores of 10 students:
60 | 65 | 70 | 75 | 80 | 85 | 90 | 92 | 95 | 100 |
Find the 70th percentile ($P_{70}$) and the percentile rank of a score of 85.
Answer:
The given data represents the test scores of 10 students. The data is already arranged in ascending order: 60, 65, 70, 75, 80, 85, 90, 92, 95, 100.
The total number of data points is $n = 10$.
Finding the 70th Percentile ($P_{70}$):
We want to find the 70th percentile, so $k = 70$.
Using the formula for the index $L$ (Formula (i)):
$\text{Index } L = \frac{k}{100} \times n = \frac{70}{100} \times 10$
$\text{Index } L = \frac{7}{10} \times 10 = 7$
[Calculated Index]
Since the index $L = 7$ is a whole number, the 70th percentile is the average of the value at the 7th position and the value at the $(7+1) = 8$-th position in the ordered dataset.
Looking at the ordered data:
1st: 60, 2nd: 65, 3rd: 70, 4th: 75, 5th: 80, 6th: 85, 7th: 90, 8th: 92, 9th: 95, 10th: 100.
The value at the 7th position is 90.
The value at the 8th position is 92.
The 70th percentile is the average of these two values:
$\text{P}_{70} = \frac{90 + 92}{2} = \frac{182}{2} = 91$
[70th Percentile]
The 70th percentile of the test scores is 91.
Finding the Percentile Rank of a score of 85:
We want to find the percentile rank of the value $x = 85$.
Using the formula for percentile rank (Formula (ii)), we first need to count the number of data points less than or equal to 85.
Looking at the ordered data: 60, 65, 70, 75, 80, 85, 90, 92, 95, 100.
The values less than or equal to 85 are: 60, 65, 70, 75, 80, 85.
The number of values less than or equal to 85 is $C = 6$.
The total number of values is $n = 10$.
Calculate the percentile rank:
$\text{Percentile Rank of } 85 = \frac{C}{n} \times 100 = \frac{6}{10} \times 100$
$= 0.6 \times 100 = 60$
$\text{Percentile Rank of } 85 = 60\%$
[Percentile Rank]
The percentile rank of a score of 85 is 60. This means that $60\%$ of the students scored 85 or less on the test.
Quartiles
Quartiles are specific percentiles that divide an ordered dataset into four equal parts, each containing approximately 25% of the data. They are particularly useful for summarizing the spread and central location of the middle half of the data.
There are three main quartiles:
First Quartile ($Q_1$): This is the value below which approximately $25\%$ of the data falls. It is the same as the 25th percentile ($P_{25}$).
Second Quartile ($Q_2$): This is the value below which approximately $50\%$ of the data falls. It is the same as the 50th percentile ($P_{50}$) and is also the Median of the dataset.
Third Quartile ($Q_3$): This is the value below which approximately $75\%$ of the data falls. It is the same as the 75th percentile ($P_{75}$).
The data is divided into four parts by the quartiles:
- The first quarter of the data lies between the minimum value and $Q_1$.
- The second quarter lies between $Q_1$ and $Q_2$.
- The third quarter lies between $Q_2$ and $Q_3$.
- The fourth quarter lies between $Q_3$ and the maximum value.
The range between the first and third quartiles ($Q_3 - Q_1$) is called the Interquartile Range (IQR), which is a measure of dispersion discussed earlier.
Method for Finding Quartiles for Ungrouped Data:
The method for finding quartiles is a direct application of the method for finding percentiles, using $k=25, 50,$ and $75$.
Step 1: Arrange the data in ascending order (from smallest to largest).
Step 2: Calculate the index (position) for each quartile:
For the First Quartile ($Q_1$, $k=25$): $\text{Index } L_1 = \frac{25}{100} \times n = \frac{n}{4}$
For the Second Quartile ($Q_2$, $k=50$): $\text{Index } L_2 = \frac{50}{100} \times n = \frac{n}{2}$
For the Third Quartile ($Q_3$, $k=75$): $\text{Index } L_3 = \frac{75}{100} \times n = \frac{3n}{4}$
Where $n$ is the total number of data points.
Step 3: Determine the quartile value based on the calculated index $L$ (same interpretation rule as for percentiles):
If $L$ is a whole number: The quartile value is the average of the value at the $L$-th position and the value at the $(L+1)$-th position in the ordered dataset.
If $L$ is not a whole number: Round $L$ up to the nearest whole number ($L' = \lceil L \rceil$). The quartile value is the value at the $L'$-th position in the ordered dataset.
Note: Different statistical software packages and textbooks may use slightly different methods for calculating percentiles and quartiles, especially when the index is not a whole number. The method described here is one common approach.
Quartile Rank
The term Quartile Rank is less formally defined compared to percentile rank and is often used to broadly categorize where a particular value falls within the dataset in terms of quartiles. For a given value $x$, its quartile rank might refer to:
- Being in the first quartile (below $Q_1$).
- Being in the second quartile (between $Q_1$ and $Q_2$).
- Being in the third quartile (between $Q_2$ and $Q_3$).
- Being in the fourth quartile (above $Q_3$).
Essentially, if you calculate the percentile rank of a value $x$, its quartile rank is implicitly determined:
- If Percentile Rank of $x \leq 25\%$, it's in the first quartile (or lower).
- If $25\% < \text{Percentile Rank of } x \leq 50\%$, it's in the second quartile.
- If $50\% < \text{Percentile Rank of } x \leq 75\%$, it's in the third quartile.
- If Percentile Rank of $x > 75\%$, it's in the fourth quartile (or upper).
So, quartile rank is more of a qualitative descriptor of a value's position relative to the quartiles rather than a specific calculation like percentile rank.
Example 2. Using the same data from Example 1: 60, 65, 70, 75, 80, 85, 90, 92, 95, 100. Find the first quartile ($Q_1$) and the third quartile ($Q_3$).
Answer:
The ordered data is: 60, 65, 70, 75, 80, 85, 90, 92, 95, 100.
The total number of data points is $n = 10$.
Finding the First Quartile ($Q_1$):
$Q_1$ is the 25th percentile ($k=25$). The index $L_1$ is:
$\text{Index } L_1 = \frac{n}{4} = \frac{10}{4} = 2.5$
[Calculated Index for Q1]
Since the index $L_1 = 2.5$ is not a whole number, we round it up to the nearest whole number, which is 3. The first quartile $Q_1$ is the value at the 3rd position in the ordered dataset.
Looking at the ordered data: 1st: 60, 2nd: 65, 3rd: 70, 4th: 75, ...
The value at the 3rd position is 70.
$\text{Q}_1 = 70$
[First Quartile]
The first quartile is 70.
Finding the Third Quartile ($Q_3$):
$Q_3$ is the 75th percentile ($k=75$). The index $L_3$ is:
$\text{Index } L_3 = \frac{3n}{4} = \frac{3 \times 10}{4} = \frac{30}{4} = 7.5$
[Calculated Index for Q3]
Since the index $L_3 = 7.5$ is not a whole number, we round it up to the nearest whole number, which is 8. The third quartile $Q_3$ is the value at the 8th position in the ordered dataset.
Looking at the ordered data: ... 5th: 80, 6th: 85, 7th: 90, 8th: 92, 9th: 95, 10th: 100.
The value at the 8th position is 92.
$\text{Q}_3 = 92$
[Third Quartile]
The third quartile is 92.
(For completeness, the Median ($Q_2$) index $L_2 = n/2 = 10/2 = 5$. Since it's a whole number, $Q_2$ is the average of the 5th and 6th values: $\frac{80+85}{2} = \frac{165}{2} = 82.5$).
Correlation
In statistics, when we analyze the relationship between two quantitative variables, we often want to know if they tend to change together and, if so, in what way and how strongly. Correlation is a statistical measure that quantifies the strength and direction of a linear relationship between two such variables. It is a value that ranges from -1 to +1.
Understanding correlation helps us describe the association between variables. For example, is there a relationship between the amount of fertilizer used and crop yield? Is there a relationship between study time and exam scores? Correlation provides a numerical summary of such associations.
Types of Correlation
Based on the direction of the linear relationship between two variables, correlation can be classified into three main types:
1. Positive Correlation:
Two variables are said to be positively correlated if they tend to increase or decrease together. This means that as the values of one variable increase, the values of the other variable also tend to increase, and vice versa. On a scatter plot, the points generally form an upward-sloping pattern from left to right.
Example: Height and weight of adults (generally, taller people tend to be heavier). Hours studied and exam scores (more hours studied often lead to higher scores).
2. Negative Correlation (Inverse Correlation):
Two variables are said to be negatively correlated if they tend to move in opposite directions. This means that as the values of one variable increase, the values of the other variable tend to decrease, and vice versa. On a scatter plot, the points generally form a downward-sloping pattern from left to right.
Example: Price of a commodity and its demand (as price increases, demand generally decreases). Number of hours of exercise and body weight (more exercise often correlates with lower body weight).
3. Zero Correlation (No Correlation):
There is zero correlation between two variables if there is no linear relationship between them. Changes in one variable do not show a consistent linear pattern with changes in the other variable. On a scatter plot, the points are scattered randomly with no discernible upward or downward linear trend.
Example: A student's shoe size and their score on a history test. The price of rice in India and the amount of rainfall in the USA.
It's important to note that zero correlation only implies the absence of a linear relationship. There might still be a non-linear relationship between the variables.
Measuring Correlation
The strength and direction of a linear relationship can be assessed using visual tools like scatter diagrams and quantified using correlation coefficients.
1. Scatter Diagram:
A Scatter Diagram (or scatter plot) is a graphical representation of the relationship between two quantitative variables. For each pair of observations $(x_i, y_i)$, a point is plotted on a Cartesian coordinate system where the x-axis represents one variable and the y-axis represents the other.
By examining the pattern of the points in a scatter diagram, we can get an initial idea about the type and strength of the relationship:
Direction: If points generally trend upwards, it suggests positive correlation. If they trend downwards, it suggests negative correlation. If there is no clear trend, it suggests zero correlation.
Strength: If the points are tightly clustered around a straight line, the linear relationship is strong. If the points are widely scattered, the linear relationship is weak.
Visual Examples of Scatter Diagrams:
Scatter diagrams are useful for a visual assessment but do not provide a precise numerical measure of correlation.
2. Karl Pearson's Coefficient of Correlation ($r$):
Karl Pearson's Coefficient of Correlation, often denoted by $r$ (for a sample) or $\rho$ (rho) (for a population), is the most widely used numerical measure of the linear correlation between two variables. It quantifies how closely the points in a scatter plot cluster around a straight line.
The value of $r$ always lies between -1 and +1, inclusive:
$r = +1$: Indicates a perfect positive linear correlation. All points lie exactly on a straight line with a positive slope.
$r = -1$: Indicates a perfect negative linear correlation. All points lie exactly on a straight line with a negative slope.
$r = 0$: Indicates no linear correlation. There is no tendency for the variables to increase or decrease together in a linear fashion.
$0 < r < +1$: Indicates a positive linear correlation. Values closer to +1 suggest a stronger positive linear relationship.
$-1 < r < 0$: Indicates a negative linear correlation. Values closer to -1 suggest a stronger negative linear relationship.
Formula for Pearson's Correlation Coefficient (for a Sample):
Given $n$ pairs of observations $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$, the definitional formula for the sample correlation coefficient $r_{xy}$ between variables $X$ and $Y$ is:
$\text{r}_{xy} = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2 \sum_{i=1}^{n} (y_i - \overline{y})^2}}$
... (i)
Where $\overline{x}$ and $\overline{y}$ are the sample means of $X$ and $Y$, respectively.
- The numerator, $\sum (x_i - \overline{x})(y_i - \overline{y})$, is the sum of the products of the deviations from the mean, also known as the covariance (though not divided by $n-1$ or $n$). It indicates whether $x_i$ and $y_i$ tend to be simultaneously above or below their means.
- The denominator normalizes this sum by dividing by the product of the standard deviations (implicitly, as $\sqrt{\sum (x_i - \overline{x})^2 \sum (y_i - \overline{y})^2} = \sqrt{(n-1)s_x^2 \cdot (n-1)s_y^2} = (n-1)s_x s_y$), ensuring that $r$ is between -1 and +1.
Alternative Computational Formula for Pearson's $r$:
The definitional formula can be tedious to use directly, especially with large datasets or non-integer means. An algebraically equivalent formula that is easier for computation is:
$\text{r}_{xy} = \frac{n \sum_{i=1}^{n} x_i y_i - (\sum_{i=1}^{n} x_i)(\sum_{i=1}^{n} y_i)}{\sqrt{\left[n \sum_{i=1}^{n} x_i^2 - \left(\sum_{i=1}^{n} x_i\right)^2\right]\left[n \sum_{i=1}^{n} y_i^2 - \left(\sum_{i=1}^{n} y_i\right)^2\right]}}$
... (ii)
Derivation of the Computational Formula from the Definitional Formula:
Let $X_i = x_i - \overline{x}$ and $Y_i = y_i - \overline{y}$.
The numerator of formula (i) is $\sum X_i Y_i = \sum (x_i - \overline{x})(y_i - \overline{y})$.
$\sum (x_i - \overline{x})(y_i - \overline{y}) = \sum (x_i y_i - x_i \overline{y} - \overline{x} y_i + \overline{x} \overline{y})$
$= \sum x_i y_i - \sum x_i \overline{y} - \sum \overline{x} y_i + \sum \overline{x} \overline{y}$
$= \sum x_i y_i - \overline{y} \sum x_i - \overline{x} \sum y_i + n \overline{x} \overline{y}$
[Since $\overline{x}, \overline{y}$ are constants]
Substitute $\sum x_i = n\overline{x}$ and $\sum y_i = n\overline{y}$:
$= \sum x_i y_i - \overline{y} (n\overline{x}) - \overline{x} (n\overline{y}) + n \overline{x} \overline{y}$
$= \sum x_i y_i - n\overline{x}\overline{y} - n\overline{x}\overline{y} + n\overline{x}\overline{y}$
$= \sum x_i y_i - n\overline{x}\overline{y}$
Substitute $\overline{x} = \frac{\sum x_i}{n}$ and $\overline{y} = \frac{\sum y_i}{n}$:
$= \sum x_i y_i - n\left(\frac{\sum x_i}{n}\right)\left(\frac{\sum y_i}{n}\right)$
$= \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}$
$= \frac{n \sum x_i y_i - (\sum x_i)(\sum y_i)}{n}$
[Numerator]
The terms in the denominator of formula (i) are the sums of squared deviations, which we know from the derivation of variance (Measure of Dispersion section):
$\sum (x_i - \overline{x})^2 = \sum x_i^2 - n\overline{x}^2 = \sum x_i^2 - n\left(\frac{\sum x_i}{n}\right)^2 = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = \frac{n \sum x_i^2 - (\sum x_i)^2}{n}$
[Sum of squared deviations for X]
$\sum (y_i - \overline{y})^2 = \sum y_i^2 - n\overline{y}^2 = \sum y_i^2 - n\left(\frac{\sum y_i}{n}\right)^2 = \sum y_i^2 - \frac{(\sum y_i)^2}{n} = \frac{n \sum y_i^2 - (\sum y_i)^2}{n}$
[Sum of squared deviations for Y]
Substituting these into the denominator of formula (i):
$\sqrt{\frac{n \sum x_i^2 - (\sum x_i)^2}{n} \times \frac{n \sum y_i^2 - (\sum y_i)^2}{n}}$
$= \sqrt{\frac{\left[n \sum x_i^2 - (\sum x_i)^2\right]\left[n \sum y_i^2 - (\sum y_i)^2\right]}{n^2}}$
$= \frac{1}{n} \sqrt{\left[n \sum x_i^2 - (\sum x_i)^2\right]\left[n \sum y_i^2 - (\sum y_i)^2\right]}$
[Denominator]
Now, combining the numerator and denominator:
$\text{r}_{xy} = \frac{\frac{n \sum x_i y_i - (\sum x_i)(\sum y_i)}{n}}{\frac{1}{n} \sqrt{\left[n \sum x_i^2 - (\sum x_i)^2\right]\left[n \sum y_i^2 - (\sum y_i)^2\right]}}$
$\text{r}_{xy} = \frac{n \sum x_i y_i - (\sum x_i)(\sum y_i)}{\sqrt{\left[n \sum x_i^2 - (\sum x_i)^2\right]\left[n \sum y_i^2 - (\sum y_i)^2\right]}}$
This derivation shows the equivalence of the two formulas for Pearson's $r$. The computational formula is generally preferred for manual calculations as it avoids decimal values in intermediate steps until the very end.
Assumptions of Pearson's r:
- The two variables are quantitative (interval or ratio scale).
- The relationship between the variables is linear.
- The data is approximately normally distributed (though $r$ can be calculated for non-normal data, inference based on it might require normality).
- There are no significant outliers, as $r$ is sensitive to extreme values.
3. Spearman's Rank Correlation Coefficient ($\rho$ or $r_s$):
Spearman's Rank Correlation Coefficient, denoted by $\rho$ (rho) or $r_s$, is a non-parametric measure of the strength and direction of the monotonic relationship between two variables. A monotonic relationship is one where the variables tend to move in the same relative direction, but not necessarily at a constant rate (i.e., not strictly linear).
Spearman's coefficient is based on the ranks of the data rather than the actual values. This makes it suitable for ordinal data and less sensitive to outliers than Pearson's $r$.
Method for Calculating Spearman's $\rho$:
Step 1: Rank the values of the first variable (X) from 1 to $n$ (or $N$). Assign rank 1 to the smallest value, 2 to the next smallest, and so on. If there are tied values, assign them the average of the ranks they would have occupied.
Step 2: Rank the values of the second variable (Y) from 1 to $n$ (or $N$), using the same process as for X.
Step 3: For each pair of observations, calculate the difference between the ranks: $d_i = \text{Rank}(x_i) - \text{Rank}(y_i)$.
Step 4: Square each difference: $d_i^2$.
Step 5: Sum the squared differences: $\sum d_i^2$.
Step 6: Apply the formula for Spearman's rank correlation coefficient (assuming no ties in ranks, or a small number of ties):
$\rho = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)}$
... (iii)
Where $n$ is the number of pairs of observations.
Interpretation of $\rho$ is similar to Pearson's $r$:
- $\rho = +1$: Perfect positive monotonic relationship.
- $\rho = -1$: Perfect negative monotonic relationship.
- $\rho = 0$: No monotonic relationship.
- Values close to +1 or -1 indicate strong monotonic correlation.
- Values close to 0 indicate weak monotonic correlation.
Spearman's $\rho$ is useful when the data is ordinal, when the relationship is monotonic but not necessarily linear, or when the data contains outliers. A more complex formula exists for handling a large number of ties.
Correlation vs. Causation: A Crucial Distinction
One of the most important principles in statistics is that Correlation Does Not Imply Causation. Observing a strong correlation between two variables means they tend to change together in a predictable linear pattern, but it does not mean that one variable causes the other to change.
For instance, there might be a strong positive correlation between the number of ice cream cones sold and the number of drowning incidents at beaches. This doesn't mean that eating ice cream causes drowning. A third factor, like hot weather, is likely influencing both variables – hot weather increases ice cream sales and also increases swimming activity, leading to more potential for drowning incidents. This is an example of a spurious correlation (a correlation that appears to exist between two variables but is not a direct causal relationship).
Establishing causation requires more than just observing a correlation; it typically involves controlled experiments, careful study design, and consideration of other potential factors (confounding variables). Descriptive statistics and correlation coefficients are valuable tools for identifying potential relationships that warrant further investigation, but they cannot definitively prove cause and effect.
Example 1. Calculate Pearson's correlation coefficient ($r$) for the following data on the number of hours studied per week (X) and the marks obtained in an exam (Y) for 5 students:
Hours Studied (X) | 2 | 3 | 4 | 5 | 6 |
Marks (Y) | 50 | 60 | 70 | 80 | 90 |
Answer:
We are given $n=5$ pairs of observations for variables X (Hours Studied) and Y (Marks). We will use the computational formula for Pearson's $r$:
$\text{r} = \frac{n \sum xy - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}}$
[Computational Formula for Pearson's r]
To use this formula, we need to calculate $\sum x$, $\sum y$, $\sum xy$, $\sum x^2$, and $\sum y^2$. Let's create a table to organize these calculations.
$x_i$ | $y_i$ | $x_i y_i$ | $x_i^2$ | $y_i^2$ |
---|---|---|---|---|
2 | 50 | 100 | 4 | 2500 |
3 | 60 | 180 | 9 | 3600 |
4 | 70 | 280 | 16 | 4900 |
5 | 80 | 400 | 25 | 6400 |
6 | 90 | 540 | 36 | 8100 |
$\sum x = 20$ | $\sum y = 350$ | $\sum xy = 1500$ | $\sum x^2 = 90$ | $\sum y^2 = 25500$ |
Now, substitute these sums and $n=5$ into the computational formula:
$\text{r} = \frac{5(1500) - (20)(350)}{\sqrt{[5(90) - (20)^2][5(25500) - (350)^2]}}$
Calculate the terms:
Numerator: $5 \times 1500 - 20 \times 350 = 7500 - 7000 = 500$
First term in denominator square root: $5 \times 90 - (20)^2 = 450 - 400 = 50$
Second term in denominator square root: $5 \times 25500 - (350)^2 = 127500 - 122500 = 5000$
Substitute these back into the formula:
$\text{r} = \frac{500}{\sqrt{(50)(5000)}}$
$\text{r} = \frac{500}{\sqrt{250000}}$
Calculate the square root:
$\sqrt{250000} = \sqrt{25 \times 10000} = \sqrt{25} \times \sqrt{10000} = 5 \times 100 = 500$
Substitute back into the formula for r:
$\text{r} = \frac{500}{500}$
$\text{r} = 1$
[Pearson's Correlation Coefficient]
The calculated Pearson's correlation coefficient is $r = 1$. This indicates a perfect positive linear correlation between the number of hours studied and the marks obtained for this specific (and perhaps ideal) dataset. As hours studied increase, marks obtained increase perfectly linearly.